{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "herbal-royalty",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "# =============================================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "id": "norwegian-dakota",
   "metadata": {},
   "source": [
    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
    "\n",
    "# BERT QA Inference with TensorRT FP16\n",
    "\n",
    "\n",
    "Bidirectional Encoder Representations from Transformers ([BERT](https://arxiv.org/abs/1810.04805)) is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. \n",
    "\n",
    "BERT provided a leap in accuracy for NLU tasks that brought high-quality language-based services within the reach of companies across many industries. To use the model in production, you need to consider factors such as latency, in addition to accuracy, which influences end user satisfaction with a service. BERT requires significant compute during inference due to its 12/24-layer stacked multi-head attention network. This has posed a challenge for companies to deploy BERT as part of real-time applications until now.\n",
    "\n",
    "NVIDIA® [TensorRT](https://developer.nvidia.com/tensorrt)™ is an SDK for high-performance deep learning inference. TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation, fraud detection, and natural language processing.\n",
    "TensorRT optimizations for BERT allows you to perform inference in 2.2 ms on T4 GPUs. This is 17x faster than CPU-only platforms and is well within the 10ms latency budget necessary for conversational AI applications. These optimizations make it practical to use BERT in production, for example, as part of a conversation AI service.\n",
    "\n",
    "This notebook demonstrates the inference of BERT models for question and answering applications with TensorRT in FP16 mode.\n",
    "\n",
    "## Pre-requisite\n",
    "Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
    "\n",
    "\n",
    "## Content\n",
    "1. [Download data and model](#1)\n",
    "1. [Building a FP16 TensorRT optimized BERT model](#2)\n",
    "1. [Running inference examples](#3)\n",
    "1. [Inference benchmarking](#4)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "nasty-frequency",
   "metadata": {},
   "source": [
    "<a id=\"1\"></a>\n",
    "\n",
    "## 1. Download data and model\n",
    "First, we download the \n",
    "Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)) and a pre-trained BERT QA model from the NVIDIA GPU Cloud ([NGC](https://ngc.nvidia.com/catalog/models/nvidia:bert_pyt_ckpt_base_qa_squad11_amp)).\n",
    "### SQuAD dataset\n",
    "\n",
    "Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "studied-sheffield",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash ../scripts/download_squad.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "laughing-arthur",
   "metadata": {},
   "source": [
    "### Fine-tuned BERT Large Model download\n",
    "\n",
    "Many AI applications have common needs: classification, object detection, language translation, text-to-speech, recommender engines, sentiment analysis, and more. When developing applications with these capabilities, it is much faster to start with a model that is pre-trained and then tune it for a specific use case. The NGC [catalog](https://ngc.nvidia.com/catalog/models) offers pre-trained models for a variety of common AI tasks that are optimized for NVIDIA Tensor Core GPUs, and can be easily re-trained by updating just a few layers, saving valuable time.\n",
    "\n",
    "Herein, we download a pretrained, fine-tuned BERT large model, trained with automatic mixed precision, from NGC."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "signed-symposium",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash ../scripts/download_model.sh large 384 v2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc5159a2",
   "metadata": {},
   "source": [
    "### Install extra dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ceba10e",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd /tmp && git clone https://github.com/vinhngx/transformers && cd transformers && pip install .\n",
    "!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "alive-slovakia",
   "metadata": {},
   "source": [
    "<a id=\"2\"></a>\n",
    "\n",
    "## 2. Building a FP16 TensorRT optimized BERT model\n",
    "\n",
    "In this section, we will be optimizing the BERT model for inference with TensorRT using FP16. The overal workflow is as below.\n",
    "\n",
    "<img src=\"Figure-1-generating-bert-trt.png\">\n",
    "\n",
    "To optimize BERT with TensorRT, we focused on optimizing the transformer cell. Since several Transformer cells are stacked in BERT, we were able to achieve significant performance gains through this set of optimizations. We use custom plugins that accelerate key operations in the Transformer Encoder elements in a BERT model. The plugins fuse multiple operations into a sub-graph in a single CUDA kernel.  Each sub-graph consists of several elementary computations, each of which requires a read and write to the global memory of the GPU (i.e. the slowest on-device memory).  By fusing the elementary operations together into a single CUDA kernel we allow for the computation to happen on a larger sub-graph while visiting the global memory a minimal amount of times. \n",
    "<img src=\"Figure-5-optimizations-through-trt.jpg\">\n",
    "\n",
    "For more information, see our developer [blog](https://developer.nvidia.com/blog/nlu-with-tensorrt-bert/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "leading-reliance",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorrt as trt;\n",
    "TRT_VERSION = trt.__version__\n",
    "\n",
    "print(\"TensorRT version: {}\".format(TRT_VERSION))\n",
    "!mkdir -p engines_$TRT_VERSION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "studied-profession",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Maximum TensorRT inference batch size\n",
    "BATCH_SIZE = 128"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "unique-batman",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build BERT TensorRT FP16 model from NGC checkpoint\n",
    "!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_large_384.engine -b $BATCH_SIZE -s 384 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "authorized-assignment",
   "metadata": {},
   "source": [
    "<a id=\"3\"></a>\n",
    "\n",
    "## 3. Running inference examples\n",
    "\n",
    "Now that we've got a TensorRT engine, the inference workflow using the optimized network is as follows:\n",
    "\n",
    " -   Start the TensorRT runtime with this engine.\n",
    " -   Feed a passage and a question to the TensorRT runtime and receive as output the answer predicted by the network.\n",
    "\n",
    "<img src=\"Figure-2-workflow-to-perform-inference-with-trt.png\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "incorporate-psychiatry",
   "metadata": {},
   "outputs": [],
   "source": [
    "PASSAGE = 'TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps'\\\n",
    "'such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops'\\\n",
    "'and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep'\\\n",
    "'learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.'\n",
    "QUESTION=\"What is TensorRT?\"\n",
    "\n",
    "!python3 ../inference.py -e engines_$TRT_VERSION/bert_large_384.engine -s 384 -p $PASSAGE -q $QUESTION -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "340fa7fb-d997-4fbf-b3bf-684085584c94",
   "metadata": {},
   "source": [
    "Let's ask a different question. Feel free to plugin your own question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "legal-brief",
   "metadata": {},
   "outputs": [],
   "source": [
    "QUESTION=\"What is included in TensorRT?\"\n",
    "\n",
    "!python3 ../inference.py -e engines_$TRT_VERSION/bert_large_384.engine -s 384 -p $PASSAGE -q $QUESTION -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "korean-simpson",
   "metadata": {},
   "source": [
    "## Validation on the SQuAD dev set\n",
    "Next, we will assess the accuracy of the TensorRT-optimized FP16 BERT model on the SQuAD dev set. \n",
    "\n",
    "There are two dominant metrics used by many question answering datasets, including SQuAD: exact match (EM) and F1 score. These scores are computed on individual question+answer pairs. When multiple correct answers are possible for a given question, the maximum score over all possible correct answers is computed. Overall EM and F1 scores are computed for a model by averaging over the individual example scores.\n",
    "\n",
    "### Exact Match\n",
    "\n",
    "This metric is as simple as it sounds. For each question+answer pair, if the characters of the model's prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0. When assessing against a negative example, if the model predicts any text at all, it automatically receives a 0 for that example.\n",
    "\n",
    "### F1\n",
    "\n",
    "F1 score is a common metric for classification problems, and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it's computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.\n",
    "\n",
    "For more info, see [reference](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html#Metrics-for-QA).\n",
    "\n",
    "Herein, we verify that the TensorRT model achieves a state-of-the-art accuracy of 90% F1 score on the SQuAD development set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "loose-musical",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 ../inference.py -e engines_$TRT_VERSION/bert_large_384.engine -s 384 -sq ./squad/dev-v1.1.json -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -o ./predictions-bert_large_384.json\n",
    "!python3 ../squad/evaluate-v1.1.py  squad/dev-v1.1.json  ./predictions-bert_large_384.json 90\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "functional-smile",
   "metadata": {},
   "source": [
    "<a id=\"4\"></a>\n",
    "\n",
    "## 4. Inference benchmarking\n",
    "BERT can be applied both for online and offline use cases. Online NLU applications, such as conversational AI,  place tight latency budgets during inference. Several models need to execute in a sequence in response to a single user query. When used as a service, the total time a customer experiences includes compute time as well as input and output network latency. Longer times lead to a sluggish performance and a poor customer experience.\n",
    "\n",
    "While the exact latency available for a single model can vary by application, several real-time applications need the language model to execute in under 10 ms. Using a Tesla T4 GPU, BERT optimized with TensorRT can perform inference in 2.2 ms for a QA task similar to available in SQuAD with batch size =1 and sequence length = 128. Using the TensorRT optimized sample, you can execute up to a batch size of 8 for BERT-base and even higher batch sizes for models with fewer Transformer layers within the 10 ms latency budget.  It took 40 ms to execute the same task with highly optimized code on a CPU-only platform for batch size = 1, while higher batch sizes did not run to completion and exit with errors.\n",
    "\n",
    "<img src=\"Figure-6-Compute-latency.jpg\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17fe9312-7b20-4997-8f00-81b2d8d3ba9c",
   "metadata": {},
   "source": [
    "Next, we will perform a couple of inference benchmarks with different batch sizes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "attractive-binary",
   "metadata": {},
   "outputs": [],
   "source": [
    "BATCH_SIZE=1\n",
    "!python3 ../perf.py -e ./engines_$TRT_VERSION/bert_large_384.engine -b $BATCH_SIZE -s 384 -i 100 -w 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "future-courage",
   "metadata": {},
   "outputs": [],
   "source": [
    "BATCH_SIZE = 64\n",
    "!python3 ../perf.py -e ./engines_$TRT_VERSION/bert_large_384.engine -b $BATCH_SIZE -s 384 -i 100 -w 20"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "floating-museum",
   "metadata": {},
   "source": [
    "### BERT base model\n",
    "\n",
    "We repeat the same process with another BERT model, the BERT-base model (110M parameters) with sequence length 128."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "equivalent-niger",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash ../scripts/download_model.sh base 128 v2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dressed-prophet",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 ../builder.py -m models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1/model.ckpt -w 40000 -o engines_$TRT_VERSION/bert_base_128.engine -b $BATCH_SIZE -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_base_qa_squad2_amp_128_v19.03.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "handmade-contrary",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 ../perf.py -e ./engines_$TRT_VERSION/bert_base_128.engine -b 1 -s 128 -i 100 -w 20"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
